MSDS 7331 - Lab 1 - Baseball (Lehmans's Dataset)¶


Team - Triston Hudgins, Shijo Joseph, Osman Kanteh, Douglas Yip

In [1]:
## Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline
import seaborn as sns
import plotly.express as px

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.preprocessing import StandardScaler
from matplotlib.pyplot import scatter
import plotly
from plotly.graph_objs import Scatter, Marker, Layout, layout,XAxis, YAxis, Bar, Line
%matplotlib inline

Business Understanding (10 points total). Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?). Describe how you would define and measure the outcomes from the dataset. That is, why is this data important and how do you know if you have mined useful knowledge from the dataset? How would you measure the effectiveness of a good prediction algorithm? Be specific¶

The reason why we selected Lehman's baseball dataset is to understand the difficult decision a Major League Baseball (MLB) General Manager (GM) has to field a competitive team of talent while balancing that the fall below the team's salary cap. Salary caps in baseball are placed to reduce anti-competitive behavior in the league. This creates guardrails and fairness on how contracts are offered to players. Teams that chose to spend more than the salary cap are penalized with "Competitive Balance Tax" (CBT). Teams are asses a 20% tax for their first season above the salary cap and tax rate become more punitive for every consective year above the salary cap.

The performance of a player are usually rewarded contracts of varying degrees. In this analysis, we will see how a players offensive stats can predict the outcome of player's salary. We will categorize the salaries from Low to Elite to measure the effectives of our LDA prediction model.

Sources

  • http://origin.mlb.com/glossary/transactions/competitive-balance-tax
  • https://bleacherreport.com/articles/32306-open-mic-why-baseball-gms-have-the-most-difficult-job

Loading the data sets Player offensive stats and salaries.¶
In [2]:
# load the Lehman's baseball Batting dataset
import pandas as pd
import numpy as np

df = pd.read_csv('https://raw.githubusercontent.com/dk28yip/MSDS7331_lab1/main/Batting.csv') # read in the csv file
df = df[(df['yearID'] >= 2000) & (df['yearID'] <= 2015)]
df.head()
Out[2]:
playerID yearID stint teamID lgID G AB R H 2B ... RBI SB CS BB SO IBB HBP SH SF GIDP
79265 abbotje01 2000 1 CHA AL 80 215 31 59 15 ... 29.0 2.0 1.0 21 38.0 1.0 2.0 2.0 1.0 2.0
79266 abbotku01 2000 1 NYN NL 79 157 22 34 7 ... 12.0 1.0 1.0 14 51.0 2.0 1.0 0.0 1.0 2.0
79267 abbotpa01 2000 1 SEA AL 35 5 1 2 1 ... 0.0 0.0 0.0 0 1.0 0.0 0.0 1.0 0.0 0.0
79268 abreubo01 2000 1 PHI NL 154 576 103 182 42 ... 79.0 28.0 8.0 100 116.0 9.0 1.0 0.0 3.0 12.0
79269 aceveju01 2000 1 MIL NL 62 1 1 0 0 ... 0.0 0.0 0.0 1 1.0 0.0 0.0 0.0 0.0 0.0

5 rows × 22 columns

In [3]:
# load the Lehman's baseball salary dataset

df_salary = pd.read_csv('https://raw.githubusercontent.com/dk28yip/MSDS7331_lab1/main/Salaries.csv') # read in the csv file
df_salary = df_salary[(df_salary['yearID'] >= 2000) & (df_salary['yearID'] <= 2015)]
df_salary.head()
Out[3]:
yearID teamID lgID playerID salary
12263 2000 ANA AL anderga01 3250000
12264 2000 ANA AL belchti01 4600000
12265 2000 ANA AL botteke01 4000000
12266 2000 ANA AL clemeed02 215000
12267 2000 ANA AL colanmi01 200000

Merging of the data sets will help us get a better picture of a players salary and their baseball stats

In [4]:
df = pd.merge(df,df_salary[['playerID','yearID','teamID','salary']],on=['playerID','yearID','teamID'], how='left')
df.head()
Out[4]:
playerID yearID stint teamID lgID G AB R H 2B ... SB CS BB SO IBB HBP SH SF GIDP salary
0 abbotje01 2000 1 CHA AL 80 215 31 59 15 ... 2.0 1.0 21 38.0 1.0 2.0 2.0 1.0 2.0 255000.0
1 abbotku01 2000 1 NYN NL 79 157 22 34 7 ... 1.0 1.0 14 51.0 2.0 1.0 0.0 1.0 2.0 500000.0
2 abbotpa01 2000 1 SEA AL 35 5 1 2 1 ... 0.0 0.0 0 1.0 0.0 0.0 1.0 0.0 0.0 285000.0
3 abreubo01 2000 1 PHI NL 154 576 103 182 42 ... 28.0 8.0 100 116.0 9.0 1.0 0.0 3.0 12.0 2933333.0
4 aceveju01 2000 1 MIL NL 62 1 1 0 0 ... 0.0 0.0 1 1.0 0.0 0.0 0.0 0.0 0.0 612500.0

5 rows × 23 columns

[10 Points] Simple Statistics - Visualize appropriate statistics (e.g., range, mode, mean, median, variance, counts) for a subset of attributes. Describe anything meaningful you found from this or if you found something potentially interesting. Note: You can also use data from other sources for comparison. Explain why the statistics run are meaningful.¶

In [5]:
#following code will describe the data
df.describe()
Out[5]:
yearID stint G AB R H 2B 3B HR RBI SB CS BB SO IBB HBP SH SF GIDP salary
count 22083.000000 22083.000000 22083.000000 22083.000000 22083.000000 22083.000000 22083.000000 22083.000000 22083.000000 22083.000000 22083.000000 22083.000000 22083.000000 22083.000000 22083.000000 22083.000000 22083.000000 22083.000000 22083.000000 1.284700e+04
mean 2007.608251 1.086854 50.089888 120.486302 16.074718 31.480822 6.287778 0.656478 3.622832 15.303446 2.055291 0.825341 11.353892 24.252276 0.889236 1.219309 1.123625 0.971109 2.738532 3.073175e+06
std 4.632573 0.297273 45.772948 180.721109 26.868157 50.531427 10.524455 1.582814 7.451858 26.456559 5.786324 1.977861 20.081184 35.650176 2.697024 2.641271 2.326378 1.890580 4.750978 4.181921e+06
min 2000.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.655740e+05
25% 2004.000000 1.000000 13.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.293000e+05
50% 2008.000000 1.000000 33.000000 18.000000 1.000000 3.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 5.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.015000e+06
75% 2012.000000 1.000000 73.000000 178.000000 21.000000 44.000000 8.000000 1.000000 3.000000 19.000000 1.000000 1.000000 14.000000 36.000000 0.000000 1.000000 1.000000 1.000000 4.000000 4.000000e+06
max 2015.000000 4.000000 163.000000 716.000000 152.000000 262.000000 59.000000 23.000000 73.000000 160.000000 78.000000 24.000000 232.000000 223.000000 120.000000 30.000000 24.000000 16.000000 32.000000 3.300000e+07

In the 15 year data set that we selected from the player stats, we collect over 22,000 rows of players offensive stats. The interesting element of these stats is that 75% of the players to the max has an extremely great range. For example, RBIs 75% of the players had 19 or less RBIs but the max is 160 RBIs or the number of hits made by 75% of the players were 44 but had a max of 262 hits in a season. With salary 75% of the players only made up to 4 Million with the max of 33 Million for one season. What we know from these stats is that baseball is a sport that is not easy sport and for one to excel and achieve annual salary in the top 25 percentile 4-33 Million/year, a player will be required to at least perform better the the 75 percentile of players to get a high paying contract.


[10 points] Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file.¶

In [6]:
print (df.dtypes)
print (df.info())
playerID     object
yearID        int64
stint         int64
teamID       object
lgID         object
G             int64
AB            int64
R             int64
H             int64
2B            int64
3B            int64
HR            int64
RBI         float64
SB          float64
CS          float64
BB            int64
SO          float64
IBB         float64
HBP         float64
SH          float64
SF          float64
GIDP        float64
salary      float64
dtype: object
<class 'pandas.core.frame.DataFrame'>
Int64Index: 22083 entries, 0 to 22082
Data columns (total 23 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   playerID  22083 non-null  object 
 1   yearID    22083 non-null  int64  
 2   stint     22083 non-null  int64  
 3   teamID    22083 non-null  object 
 4   lgID      22083 non-null  object 
 5   G         22083 non-null  int64  
 6   AB        22083 non-null  int64  
 7   R         22083 non-null  int64  
 8   H         22083 non-null  int64  
 9   2B        22083 non-null  int64  
 10  3B        22083 non-null  int64  
 11  HR        22083 non-null  int64  
 12  RBI       22083 non-null  float64
 13  SB        22083 non-null  float64
 14  CS        22083 non-null  float64
 15  BB        22083 non-null  int64  
 16  SO        22083 non-null  float64
 17  IBB       22083 non-null  float64
 18  HBP       22083 non-null  float64
 19  SH        22083 non-null  float64
 20  SF        22083 non-null  float64
 21  GIDP      22083 non-null  float64
 22  salary    12847 non-null  float64
dtypes: float64(10), int64(10), object(3)
memory usage: 4.0+ MB
None

In the 15 year data set that we selected from the player stats, we collect over 22,000 rows of players offensive stats. The interesting element of these stats is that 75% of the players to the max has an extremely great range. For example, RBIs 75% of the players had 19 or less RBIs but the max is 160 RBIs or the number of hits made by 75% of the players were 44 but had a max of 262 hits in a season. With salary 75% of the players only made up to 4 Million with the max of 33 Million for one season. What we know from these stats is that baseball is a sport that is not easy sport and for one to excel and achieve annual salary in the top 25 percentile 4-33 Million/year, a player will be required to at least perform better the the 75 percentile of players to get a high paying contract.

Summary of values¶

A total of +22,000 offsensive players were recorded in the dataset from 2000 - 2015. The following full season statistics (columns of continous variables) will be used

  • G:- Games played
  • AB:- Number of plate apearances
  • R:- Number of times a player scores
  • H:- Number of hits
  • 2B:- Number of doubles
  • 3B:- Number of triples
  • HR:- Number of home runs
  • RBI:- Number of Runs Batted In
  • SB:- Number of stolen bases
  • CS:- Number of times caught stealing
  • BB:- Number of base on balls (walks)
  • SO:- Number of strike outs
  • IBB:- Number of intentional base on balls (walks)
  • HBP:- Number of times hit by pitch
  • SH:- Number of sacrifice hits. Recorded when player is able to advance the runner
  • SB:- Number of sacrifice bunts. Recorded when player is able to advance the runner
  • SF:- Number of sacrifice fly. Recorded when player is able to advance the runner and run scores
  • GIDP:- Number of Grounded in Double Play. Recorded when player inability to hit that results into two outs.
  • Salary:- Players earnings in the season

    Categorical Varialbes definition butwill not be used in analysis

  • playerID:- Unique Identifier of player
  • teamID:- Unique Identifier of team
  • lgID:- Unique Identifier which league they play in
In [7]:
display(df.shape)
(22083, 23)

The dataset has 22,083 rows and 23 columns


[15 points] Verify data quality: Explain any missing values, duplicate data, and outliers. Are those mistakes? How do you deal with these problems? Be specific.¶

In [8]:
#plotting the games vs. salary to gain an understanding of the data
#df.plot(kind="scatter",x="salary",y="G")
px.scatter(df,
           x="salary", y="G",
           title= "Salary by Number of Games Played",
           labels={'salary': 'Salary',
                   'G': 'Number of Games'})
In [9]:
#plotting salary vs player - Difficult to read.  Long load time.  Is it necessary?
#df.salary.plot.bar()
In [10]:
#check for NA
df.isnull().sum()
Out[10]:
playerID       0
yearID         0
stint          0
teamID         0
lgID           0
G              0
AB             0
R              0
H              0
2B             0
3B             0
HR             0
RBI            0
SB             0
CS             0
BB             0
SO             0
IBB            0
HBP            0
SH             0
SF             0
GIDP           0
salary      9236
dtype: int64

Salaries are null for 9236 records.

In [11]:
# Any missing values in the dataset
def plot_missingness(df: pd.DataFrame=df) -> None:
    nan_df = pd.DataFrame(df.isna().sum()).reset_index()
    nan_df.columns  = ['Column', 'NaN_Count']
    nan_df['NaN_Count'] = nan_df['NaN_Count'].astype('int')
    nan_df['NaN_%'] = round(nan_df['NaN_Count']/df.shape[0] * 100,1)
    nan_df['Type']  = 'Missingness'
    nan_df.sort_values('NaN_%', inplace=True)

    # Add completeness
    for i in range(nan_df.shape[0]):
        complete_df = pd.DataFrame([nan_df.loc[i,'Column'],df.shape[0] - nan_df.loc[i,'NaN_Count'],100 - nan_df.loc[i,'NaN_%'], 'Completeness']).T
        complete_df.columns  = ['Column','NaN_Count','NaN_%','Type']
        complete_df['NaN_%'] = complete_df['NaN_%'].astype('int')
        complete_df['NaN_Count'] = complete_df['NaN_Count'].astype('int')
        nan_df = pd.concat([nan_df,complete_df], sort=True)
            
    nan_df = nan_df.rename(columns={"Column": "Feature", "NaN_%": "Missing %"})

    # Missingness Plot
    fig = px.bar(nan_df,
                 x='Feature',
                 y='Missing %',
                 title=f"Missingness Plot (N={df.shape[0]})",
                 color='Type',
                 opacity = 0.6,
                 color_discrete_sequence=['red','#808080'],
                 width=800,
                 height=400)
    fig.show()

plot_missingness(df)

We want to see what type of players don't have salary details.

In [12]:
null_data = df[df.isnull().any(axis=1)]
display(null_data)

ax = null_data.boxplot(column=['G', 'AB','H'])
ax.set_yscale('log')
playerID yearID stint teamID lgID G AB R H 2B ... SB CS BB SO IBB HBP SH SF GIDP salary
8 alcanis01 2000 1 BOS AL 21 45 9 13 1 ... 0.0 0.0 3 7.0 0.0 0.0 0.0 0.0 0.0 NaN
14 allench01 2000 1 MIN AL 15 50 2 15 3 ... 0.0 2.0 3 14.0 0.0 1.0 0.0 1.0 1.0 NaN
15 allendu01 2000 1 SDN NL 9 12 0 0 0 ... 0.0 0.0 2 5.0 0.0 0.0 0.0 0.0 1.0 NaN
16 allendu01 2000 2 DET AL 18 16 5 7 2 ... 0.0 0.0 2 7.0 0.0 0.0 0.0 0.0 0.0 NaN
22 alvarcl01 2000 1 PHI NL 2 5 1 1 0 ... 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
22069 ynoara01 2015 1 COL NL 72 127 14 33 8 ... 1.0 0.0 3 28.0 0.0 0.0 1.0 0.0 2.0 NaN
22074 younger03 2015 2 NYN NL 18 8 9 0 0 ... 3.0 2.0 0 1.0 0.0 1.0 0.0 0.0 0.0 NaN
22078 zitoba01 2015 1 OAK AL 3 0 0 0 0 ... 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
22080 zobribe01 2015 2 KCA AL 59 232 37 66 16 ... 2.0 3.0 29 30.0 1.0 1.0 0.0 2.0 3.0 NaN
22082 zychto01 2015 1 SEA AL 13 0 0 0 0 ... 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 NaN

9236 rows × 23 columns

Based on the box 75% of the Players with no salary have played less than 75 games or has less than 75 at bats or 10 hits. Although the rest may have significant playing time, we cannot imput the data since the salary of contracts must be manually inputted. As a result of this project, we will remove the rows.

In [13]:
print("Number of Rows before removing:", len(df))
df_clean = df.dropna()
print("Total number of rows after removing the rows with missing values:",len(df_clean))
Number of Rows before removing: 22083
Total number of rows after removing the rows with missing values: 12847

Removing the rows with missing values¶

In [14]:
df_clean.dtypes
Out[14]:
playerID     object
yearID        int64
stint         int64
teamID       object
lgID         object
G             int64
AB            int64
R             int64
H             int64
2B            int64
3B            int64
HR            int64
RBI         float64
SB          float64
CS          float64
BB            int64
SO          float64
IBB         float64
HBP         float64
SH          float64
SF          float64
GIDP        float64
salary      float64
dtype: object
In [15]:
# observed that some player IDs had multiple entires in the same year
df_clean = df_clean.groupby(['playerID', 'yearID'], as_index=False).sum()
print (df_clean)
        playerID  yearID  stint    G   AB   R   H  2B  3B  HR  ...   SB   CS  \
0      aardsda01    2004      1   11    0   0   0   0   0   0  ...  0.0  0.0   
1      aardsda01    2007      1   25    0   0   0   0   0   0  ...  0.0  0.0   
2      aardsda01    2008      1   47    1   0   0   0   0   0  ...  0.0  0.0   
3      aardsda01    2009      1   73    0   0   0   0   0   0  ...  0.0  0.0   
4      aardsda01    2010      1   53    0   0   0   0   0   0  ...  0.0  0.0   
...          ...     ...    ...  ...  ...  ..  ..  ..  ..  ..  ...  ...  ...   
12838  zumayjo01    2008      1   21    0   0   0   0   0   0  ...  0.0  0.0   
12839  zumayjo01    2009      1   29    0   0   0   0   0   0  ...  0.0  0.0   
12840  zumayjo01    2010      1   31    0   0   0   0   0   0  ...  0.0  0.0   
12841  zuninmi01    2014      1  131  438  51  87  20   2  22  ...  0.0  3.0   
12842  zuninmi01    2015      1  112  350  28  61  11   0  11  ...  0.0  1.0   

       BB     SO  IBB   HBP   SH   SF  GIDP     salary  
0       0    0.0  0.0   0.0  0.0  0.0   0.0   300000.0  
1       0    0.0  0.0   0.0  0.0  0.0   0.0   387500.0  
2       0    1.0  0.0   0.0  0.0  0.0   0.0   403250.0  
3       0    0.0  0.0   0.0  0.0  0.0   0.0   419000.0  
4       0    0.0  0.0   0.0  0.0  0.0   0.0  2750000.0  
...    ..    ...  ...   ...  ...  ...   ...        ...  
12838   0    0.0  0.0   0.0  0.0  0.0   0.0   420000.0  
12839   0    0.0  0.0   0.0  0.0  0.0   0.0   735000.0  
12840   0    0.0  0.0   0.0  0.0  0.0   0.0   915000.0  
12841  17  158.0  1.0  17.0  0.0  4.0  12.0   504100.0  
12842  21  132.0  0.0   5.0  8.0  2.0   6.0   523500.0  

[12843 rows x 21 columns]
In [16]:
#reviewing the years included in the dataset

df_clean.yearID.value_counts()
Out[16]:
2015    814
2001    814
2008    812
2011    812
2012    811
2013    809
2007    806
2002    804
2014    801
2004    800
2003    799
2000    798
2005    796
2010    790
2006    790
2009    787
Name: yearID, dtype: int64
In [17]:
player_counts = []
for playerID in df_clean['playerID']:
    player_counts.append(list(df_clean['playerID']).count(playerID))
    
print('The average number of times a playerID appears is {:0.4f}'.format(np.mean(player_counts)))
The average number of times a playerID appears is 6.8127
In [18]:
df_clean.duplicated()
Out[18]:
0        False
1        False
2        False
3        False
4        False
         ...  
12838    False
12839    False
12840    False
12841    False
12842    False
Length: 12843, dtype: bool
In [19]:
#drop columns with insignificant values
# Drop the non-beneficial ID columns, 'EIN' and 'NAME'.
df_clean = df_clean.drop(columns=['playerID'])
df_clean.head()
Out[19]:
yearID stint G AB R H 2B 3B HR RBI SB CS BB SO IBB HBP SH SF GIDP salary
0 2004 1 11 0 0 0 0 0 0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 300000.0
1 2007 1 25 0 0 0 0 0 0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 387500.0
2 2008 1 47 1 0 0 0 0 0 0.0 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0 403250.0
3 2009 1 73 0 0 0 0 0 0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 419000.0
4 2010 1 53 0 0 0 0 0 0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 2750000.0
In [20]:
# For detecting outliers I will use LocalOutlierFactor. I will use default values of 20 and 'auto'.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
clf=LocalOutlierFactor(n_neighbors=20, contamination='auto')
clf.fit_predict(df_clean)
df_scores=clf.negative_outlier_factor_
df_scores= np.sort(df_scores)
df_scores[0:20]
Out[20]:
array([-2553.55270262, -2104.40459025, -2004.77421899, -1949.86900436,
       -1861.75941593, -1634.28817647, -1487.24792154, -1392.28149387,
       -1373.94061793, -1362.07828775, -1354.64948962, -1335.16228368,
       -1266.08438327, -1241.66885913, -1224.46540826, -1212.63352355,
       -1209.25294608, -1206.24049872, -1198.70758681, -1157.01770245])
In [21]:
#sns.boxplot(df_scores);
px.box(df_scores)
In [22]:
threshold=np.sort(df_scores)[5]
print(threshold)
df_clean = df_clean.loc[df_scores > threshold]
df_clean = df_clean.reset_index(drop=True)
-1634.288176471475
In [23]:
df_clean.shape
Out[23]:
(12837, 20)

[5 points] Are there other features that could be added to the data or created from existing features? Which ones?¶

For prediction purposes, we have categorized salary and grouped players salaries based on the salary range (Low - below 50%, Medium - 50%-75%, High-+75%, Elite)¶

  • Low (0-1,999,999)
  • Medium (2-5,999,999)
  • High (6,000,000 - 14,999,999)
  • Elite (+15,000,000)
In [24]:
df_clean['salary_cut'] = pd.cut(df_clean['salary'], bins = [0,1999999,5999999,14999999,50000000], labels=["Low", "Medium", "High", "Elite"], right=True)
df_clean['salary_cut_Numeric'] = pd.cut(df_clean['salary'], bins = [0,1999999,5999999,14999999,50000000], labels=[0, 1, 2, 3], right=True)


df_clean.head()
Out[24]:
yearID stint G AB R H 2B 3B HR RBI ... BB SO IBB HBP SH SF GIDP salary salary_cut salary_cut_Numeric
0 2006 1 5 3 0 0 0 0 0 0.0 ... 2 0.0 0.0 0.0 0.0 0.0 0.0 327000.0 Low 0
1 2011 1 29 0 0 0 0 0 0 0.0 ... 0 0.0 0.0 0.0 0.0 0.0 0.0 418000.0 Low 0
2 2012 1 37 7 0 1 0 0 0 0.0 ... 0 3.0 0.0 0.0 0.0 0.0 1.0 485000.0 Low 0
3 2014 1 69 0 0 0 0 0 0 0.0 ... 0 0.0 0.0 0.0 0.0 0.0 0.0 525900.0 Low 0
4 2015 1 62 0 0 0 0 0 0 0.0 ... 0 0.0 0.0 0.0 0.0 0.0 0.0 1087500.0 Low 0

5 rows × 22 columns

Two other metrics can be derived from this¶

  • On Base Percentage (OBP), quantity of getting on base per at bat
  • Slugging Percentage (SLG), quality of hits per at bat
In [25]:
#add new columns
df_clean["OBP"] = np.where((df_clean["AB"] +  df_clean["IBB"] + df_clean["BB"] + df_clean["HBP"] + df_clean["SF"]) !=0, 
                           (df_clean["H"] +  df_clean["IBB"] + df_clean["BB"] + df_clean["HBP"])/(df_clean["AB"] +  df_clean["IBB"] + df_clean["BB"] + df_clean["HBP"] + df_clean["SF"]),
                            0)
df_clean["SLG"] = np.where(df_clean["AB"] !=0,
                           (df_clean["H"] + df_clean["2B"]*2 + df_clean["3B"]*3 + df_clean["HR"]*4)/df_clean["AB"],
                           0)
df_clean.head()
Out[25]:
yearID stint G AB R H 2B 3B HR RBI ... IBB HBP SH SF GIDP salary salary_cut salary_cut_Numeric OBP SLG
0 2006 1 5 3 0 0 0 0 0 0.0 ... 0.0 0.0 0.0 0.0 0.0 327000.0 Low 0 0.400000 0.000000
1 2011 1 29 0 0 0 0 0 0 0.0 ... 0.0 0.0 0.0 0.0 0.0 418000.0 Low 0 0.000000 0.000000
2 2012 1 37 7 0 1 0 0 0 0.0 ... 0.0 0.0 0.0 0.0 1.0 485000.0 Low 0 0.142857 0.142857
3 2014 1 69 0 0 0 0 0 0 0.0 ... 0.0 0.0 0.0 0.0 0.0 525900.0 Low 0 0.000000 0.000000
4 2015 1 62 0 0 0 0 0 0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1087500.0 Low 0 0.000000 0.000000

5 rows × 24 columns

In [26]:
df_clean.max()
Out[26]:
yearID                      2015
stint                          5
G                            163
AB                           716
R                            152
H                            262
2B                            59
3B                            23
HR                            73
RBI                        160.0
SB                          78.0
CS                          24.0
BB                           232
SO                         223.0
IBB                        120.0
HBP                         30.0
SH                          24.0
SF                          16.0
GIDP                        32.0
salary                33000000.0
salary_cut                 Elite
salary_cut_Numeric             3
OBP                          1.0
SLG                          5.0
dtype: object

[15 points] Visualize the most important attributes appropriately (at least 5 attributes). Important: Provide an interpretation for each chart. Explain for each attribute why the chosen visualization is appropriate.¶

Utilizing domain knowledge of the sport, we prioritized attributes to compare against each salary group and explored possible correlations.¶

The following boxplot shows the salary distribution for the clean dataframe.

In [27]:
# Create a histogram salaries from 2008.
#plt.hist((df_clean['salary']/1e6), bins=6, color='g', edgecolor='black', linewidth=1.2, align='mid');
#plt.xlabel('salary (millions of $)'), plt.ylabel('Count')
#plt.title('MLB  Salary Distribution', size = 14);

#Plotly Histogram - Which one????
px.histogram(df_clean['salary']/1e6, x= "salary",
             nbins=20,
             title = 'MLB Salary Distribution',
             labels= {'salary': 'Salary (Millions of $)'})

The below boxplot shows that the median number of games played in the Elite salary group is 48 higher than the Low salary group. This may suggest that experience or seniority has an efect on skill level and salary.

In [28]:
## G by salary cut
px.box(df_clean,
       x="salary_cut", y="G", 
       color="salary_cut",
       title = "Number of Games by Salary Cut",
       labels={'salary_cut': 'Salary Cut',
               'G': 'Number of Games'})

The below boxplot demonstrates OBP vs Salary Cut. It is shown that the median On Base Percentage slightly increases as salary increases, with the Low median equal to 0.31 and the Elite median equal to 0.34.

In [29]:
## OBP by salary cut
#sns.boxplot( x="salary_cut", y="OBP", data=df_clean).set(title = 'OBP by Salary Cut')
px.box(df_clean,
       x="salary_cut", y="OBP", 
       color="salary_cut",
       title = "On Base Percentage by Salary Cut",
       labels={'salary_cut': 'Salary Cut',
               'OBP': 'On Base Percentage (OBP)'})

The following boxplot shows a slight increase in median slugging percentage with an increase in salary. There appears to be errors on all groups. The slugging percentage should not be greater than 1. These outliers should be explored prior to analysis.

In [30]:
## SLG by salary cut
#sns.boxplot(x="salary_cut", y="SLG", data=df_clean).set(title = 'SLG by Salary Cut')
px.box(df_clean,
       x="salary_cut", y="SLG",
       color="salary_cut",
       title = "SLG by Salary Cut",
       labels={'salary_cut': 'Salary Cut',
               'SLG': 'Slugging Percentage'})

The below boxplot reflects the defined salary cut groupings and gives insight on each group.

In [31]:
## salary ranges by salary cut
#sns.boxplot(x="salary_cut", y="salary", data=df_clean).set(title = 'Salary by Salary Cut')
px.box(df_clean,
       x="salary_cut", y="salary",
       color="salary_cut",
       title = "Salary by Salary Cut",
       labels={'salary_cut': 'Salary Cut',
               'salary': 'Salary'})

The scatterplot below demonstrates the relationship between salary and on base percentage. It shows that the OBP clusters get tighter as salary increases. The outliers and / or errors should be explored prior to any analysis.

In [32]:
## scatterplot ranges by salary cut
#sns.scatterplot(x="salary", y="OBP", hue="salary_cut", data=df_clean)
px.scatter(df_clean,
           x="salary", y="OBP",
           color="salary_cut",
           title="On Base Percentage by Salary",
           labels={'OBP': 'On Base Percentage',
                  'salary_cut': 'Salary Cut',
                   'salary': 'Salary'})

The following scatterplot shows the correlation between runs batted in and the number of homeruns. Filtering the plot, shows that the linear trend and constant variance among all salary groupings.

In [33]:
## scatterplot ranges by salary cut
#sns.scatterplot(x="HR", y="RBI", hue="salary_cut", data=df_clean)
px.scatter(df_clean,
           x="HR", y="RBI",
           color="salary_cut",
           title="Runs Batted In by Homeruns",
           labels={'RBI': 'Runs Batted In',
                   'HR': 'Homeruns',
                   'salary_cut': 'Salary Cut'})

The next scatterplot shows the relationship between salary and slugging percentage. It behaves in a similar fashion to the On Base Percentage by Salary scatterplot in that the clustering gets tighter as the salary progresses.

In [34]:
## scatterplot ranges by salary cut
#sns.scatterplot(x="salary", y="SLG", data=df_clean)
px.scatter(df_clean,
           x="salary", y="SLG",
           color="salary_cut",
           title="Slugging Perccentage by Salary",
           labels={'salary': 'Salary',
                   'SLG':"Slugging Perccentage",
                   'salary_cut': 'Salary Cut'})

The below 3D plot shows the correlation between runs batted in, on base percentage, and number of hits. The size of each bubble represents the number of homeruns.

In [35]:
px.scatter_3d(df_clean,
              x="RBI", y="OBP",z="H",
              color="salary_cut",
              size="HR",
              title="Runs Batted In (RBI) vs On Base Percentage (OBP) vs Hits (H), Sized by NUmber of Homeruns",
              labels= {'salary_cut': 'Salary Cut'})

Important (Transformation of data and identification of outlier)¶

We observe that there are players that have salaries but no offensive stats. What we realized is that pitchers are also include in this dataset and would be outliers in this analysis.

To ensure that we do not sku or analysis, we will focus on salary on offsensive players. Since we do not have player position in the data set, we will use OBP = 0 (no hitting statistics) to assume that these players are pitchers.

In [36]:
# Delete rows where case numbers are zero
# This deletion is completed by "selecting" rows where case numbers are non zero
df_clean = df_clean.loc[df_clean["OBP"] != 0]
df_clean.shape
Out[36]:
(8754, 24)

[15 points] Explore Joint Attributes - Visualize relationships between attributes: Look at the attributes via scatter plots, correlation, cross-tabulation, group-wise averages, etc. as appropriate. Explain any interesting relationships.¶

In [37]:
# plot the correlation matrix using seaborn
sns.set(style="darkgrid") # one of the many styles to plot using
cmap = sns.diverging_palette(220, 10, as_cmap=True) # one of the many color mappings
f, ax = plt.subplots(figsize=(15, 15))

sns.heatmap(df_clean.corr(), cmap=cmap, annot=True)
f.tight_layout()
In [38]:
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
sns.set()

df_clean2 = df_clean.loc[:,~df_clean.columns.isin([ '2B', '3B', 'HR', 'AB'])]
sns.pairplot(df_clean2, hue="salary_cut", height=2)
Out[38]:
<seaborn.axisgrid.PairGrid at 0x18f42db8dc8>

Intepretation of Joint Attributes¶

Based on both the correlation plot and scatter plots, we definetly see the following

  • Hits are highly correlated to doubles, triples, hrs and RBIs. This make sense since the others are classification of the type of hit. The scatter plot we observe the observations trending positive and correlation table show >0.8 for many of these offensive hitting stats
  • We have lower salary players to have limited offensive stats based on the distribution which skews right more vs others.
  • We do not see good seperation of a players hitting stats to salary. Fielding statistics are other variables we did not consider that may result in better reasoning why a player receives a higher salary.

[10 points] Explore Attributes and Class - Identify and explain interesting relationships between features and the class you are trying to predict (i.e., relationships with variables and the target classification).¶

In [39]:
column = ['H', 'SLG', 'OBP', 'HR', 'RBI']
for col in column:
    plt.subplots(figsize=(20, 8))
    sns.violinplot(x="salary_cut", hue="salary_cut", y=col, data=df_clean,
            kind='violin', # other options: violin, bar, box, and others 
               palette='PRGn',
               height=7,ci=95)
  1. The first graph is based on the distribution of hits by the salary cut. This data shows that low salary players dont have hits or are very low as it looks like its more leaning toward 0. As you go up the salary cuts, you can see that that hits are getting more bimodal and we are seeing a secondary peak at 175.

  2. The second graph is based on the distribution of SLG by the salary cut. This data shows that high and elite paid players have very similar SLG distribution. Other things like variables like field statistic could differentiate these players futher that were not part of this analysis.

  3. The third graph is based on the distribution of OBP by the salary cut. This data shows that high and elite paid players have very similar SLG distribution. Other things like variables like field statistic could differentiate these players futher that were not part of this analysis.

  4. The fourth graph is based on the distribution of HomeRuns by the salary cut. This data shows that low salary players dont have homeruns or are very low as it looks like its more leaning toward 0. As you go up the salary cuts, you can see that that homeruns are getting more bimodal and we are seeing a secondary peak at 30.

  5. The fifth graph is based on the distribution of RBIs by the salary cut. This data shows that low salary players dont have RBI's or are very low as it looks like its more leaning toward 0. As you go up the salary cuts, you can see that that RBIs are getting fatter between the 50 to 100 range.


[10 points] Exceptional Work - You have free reign to provide additional analyses. One idea: implement dimensionality reduction, then visualize and interpret the results.¶

We are looking to use PCA to reduce dimensions. below is the code.¶

In [40]:
# Here we will use PCA for dimensionality Reduction.

from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
import numpy
import matplotlib.pyplot as plot
In [41]:
# You must normalize the data before applying the fit method


df_PCA = df_clean.loc[:,~df_clean.columns.isin(['playerID', 'stint', 'teamID','lgID','salary','salary_cut', 'salary_cut_Numeric', '2B', '3B', 'HR', 'AB'])]
df_PCA_normalized = (df_PCA - df_PCA.mean())/ df_PCA.std()
pca = PCA(n_components=df_PCA.shape[1])
pca.fit(df_PCA_normalized)
Out[41]:
PCA(n_components=16)
In [42]:
# Reformat and view results
loadings = pd.DataFrame(pca.components_.T,
columns=['PC%s' % _ for _ in range(len(df_PCA_normalized.columns))],
index=df_PCA.columns)
print(loadings)
             PC0       PC1       PC2       PC3       PC4       PC5       PC6  \
yearID -0.008901 -0.040801 -0.425967  0.858662 -0.048549  0.232788  0.025705   
G       0.324420 -0.059554 -0.060321 -0.008147 -0.094698 -0.001422 -0.088874   
R       0.334186 -0.065055 -0.009104 -0.026807  0.024623 -0.023152 -0.013108   
H       0.333427 -0.067219 -0.048140 -0.013319 -0.043404 -0.021461 -0.103177   
RBI     0.325746  0.096944 -0.129968 -0.083618 -0.004067 -0.021158 -0.083931   
SB      0.187306 -0.460372  0.328737  0.211633  0.269830 -0.149503  0.020630   
CS      0.205413 -0.441717  0.312554  0.157470  0.195548 -0.169695  0.000791   
BB      0.310347  0.060454 -0.059520 -0.050941  0.208172  0.110831  0.090298   
SO      0.297771 -0.038630 -0.177738  0.069143 -0.051413 -0.055656  0.025003   
IBB     0.207870  0.229141 -0.110264 -0.145873  0.633092  0.458901  0.301060   
HBP     0.227067 -0.013720 -0.060987 -0.052553 -0.427354 -0.197095  0.810689   
SH     -0.032789 -0.539919 -0.004463 -0.241925 -0.340085  0.723632 -0.015357   
SF      0.269900  0.042612 -0.161400 -0.112905 -0.106954 -0.058005 -0.346518   
GIDP    0.277258  0.061816 -0.202465 -0.085261 -0.174564 -0.055060 -0.278578   
OBP     0.147237  0.331524  0.551140  0.227766 -0.143847  0.244581  0.030975   
SLG     0.190279  0.323551  0.409920  0.152711 -0.259698  0.196314 -0.132832   

             PC7       PC8       PC9      PC10      PC11      PC12      PC13  \
yearID -0.047531  0.070211  0.072329 -0.006102  0.082508 -0.045151 -0.010539   
G      -0.037211 -0.159018 -0.089627 -0.026453 -0.276570 -0.369297  0.723044   
R       0.052173 -0.046977 -0.072740  0.192712  0.233202 -0.303457 -0.222097   
H      -0.048840 -0.093236  0.076436  0.126309  0.007354 -0.401241 -0.068613   
RBI     0.077025  0.017960 -0.041650  0.079085  0.044112 -0.272401 -0.524190   
SB     -0.005245  0.118673  0.095730  0.609859 -0.166458  0.252253  0.030747   
CS     -0.023602 -0.006631  0.160584 -0.722041  0.115735 -0.053018 -0.085690   
BB      0.060376 -0.080228 -0.349951  0.024729  0.700104  0.311951  0.273813   
SO      0.349519 -0.192785 -0.492946 -0.167943 -0.490595  0.369833 -0.179065   
IBB    -0.034845  0.092351  0.277740 -0.081086 -0.265794  0.005514  0.003435   
HBP    -0.081712  0.158443  0.163049  0.000043  0.011717  0.060422  0.018270   
SH      0.002425  0.005891 -0.036381  0.006191  0.018175  0.038798 -0.068027   
SF     -0.247870  0.793360 -0.050400 -0.103713 -0.056864  0.182372  0.059063   
GIDP   -0.281245 -0.452848  0.525610  0.006991  0.018547  0.436380 -0.033567   
OBP    -0.554420 -0.110210 -0.291875 -0.018781 -0.103187  0.026670 -0.121799   
SLG     0.634906  0.143269  0.323125 -0.007684  0.055205  0.064670  0.092390   

            PC14      PC15  
yearID -0.021943 -0.020074  
G      -0.264310 -0.172349  
R       0.451996 -0.659179  
H       0.416050  0.705050  
RBI    -0.697652  0.034812  
SB     -0.127462  0.028249  
CS     -0.041164 -0.013947  
BB     -0.091850  0.161530  
SO      0.162019  0.028267  
IBB     0.062995 -0.026456  
HBP    -0.017709  0.016580  
SH     -0.038237 -0.000385  
SF      0.083584 -0.023196  
GIDP    0.000561 -0.086900  
OBP     0.000035 -0.006559  
SLG     0.013560  0.008776  
In [43]:
plot.plot(pca.explained_variance_ratio_)
plot.ylabel('Explained Variance')
plot.xlabel('Components')
plot.show()

It looks like we would need 3 PC components to explain 80% of the variance.

  • PC0 is based on the hitting metrics like hits, HRs and RBIS that represents this PC
  • PC1 is based on the speed of the player as stolen bases (SB), caught stealing (CS) and sacrific hits (SH) that reprsents this PC -PC2 is based on the quality and quantity of the hits of a offensive player since SLG and OBP are metrics that represents this PC.

Retrying PCA Analysis¶

Rerunning pcA and LDA to have the target be used to figure out what components are needed. The analysis will used the 3 PC that we identified to complete this analysis.

from sklearn.manifold import TSNE from sklearn.decomposition import PCA from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA from sklearn.preprocessing import StandardScaler from matplotlib.pyplot import scatter import plotly from plotly.graph_objs import Scatter, Marker, Layout, layout,XAxis, YAxis, Bar, Line %matplotlib inline

In [44]:
# removing unnessary columns from the df

df_DR = df_clean.loc[:,~df_clean.columns.isin(['playerID', 'stint', 'teamID','lgID','salary_cut', 'salary', '2B', '3B', 'HR', 'AB'])]

# Set our target as the 'charges'
target = df_DR['salary_cut_Numeric']
target_names='salary_cut_Numeric'
Y = target

# Delete the column of target from our table
df_DR = df_DR.drop("salary_cut_Numeric",axis=1)

X = df_DR.values
X = StandardScaler().fit_transform(X)

pca = PCA(n_components=3)
X_pca = pca.fit(X).transform(X) # fit data and then transform it


lda = LDA(n_components=3)
X_lda = lda.fit(X, Y).transform(X) # fit data and then transform it

# print the components

print ('pca:', pca.components_)
print ('lda:', lda.scalings_.T)
pca: [[-0.0089013   0.32441973  0.33418634  0.33342664  0.32574579  0.18730595
   0.20541281  0.31034681  0.29777111  0.20787009  0.22706734 -0.03278896
   0.2699004   0.27725781  0.14723717  0.19027917]
 [-0.04080066 -0.05955397 -0.0650549  -0.06721933  0.09694379 -0.46037161
  -0.44171758  0.06045535 -0.03863086  0.22914062 -0.01371994 -0.53991885
   0.04261204  0.06181653  0.33152366  0.32355106]
 [-0.42596662 -0.06032014 -0.00910467 -0.04814028 -0.12996987  0.32873751
   0.31255342 -0.05951744 -0.17773947 -0.11026498 -0.06098641 -0.00446316
  -0.16140007 -0.20246507  0.55113976  0.4099207 ]]
lda: [[ 0.41780707 -1.61275672  0.11305793  0.71965124  0.76278146  0.08773866
  -0.2566929   0.71432957 -0.24336354  0.2029815   0.01222968  0.19180684
   0.003746    0.29643762 -0.05493118 -0.05489834]
 [-0.48394617 -0.64415856 -1.22405547  1.92064957  0.11561116 -0.14934072
   0.2842146   0.50869695 -0.0476467  -0.68400953 -0.01215602  0.27954507
   0.30701564 -0.02748849  0.0190746  -0.17976016]
 [ 0.42187378  0.49169227 -1.22036286  1.75266478 -0.7163932  -0.09239516
  -0.08272393 -0.88521599  0.12664837  0.49132436  0.26225622  0.32058846
   0.23082037 -0.03381534  0.18119239 -0.15068359]]
In [45]:
# this function definition just formats the weights into readable strings (from class notes).
def get_feature_names_from_weights(weights, names):
    tmp_array = []
    for comp in weights:
        tmp_string = ''
        for fidx,f in enumerate(names):
            if fidx>0 and comp[fidx]>=0:
                tmp_string+='+'
            tmp_string += '%.2f*%s ' % (comp[fidx],f[:8])
        tmp_array.append(tmp_string)
    return tmp_array
In [46]:
pca_weight_strings = get_feature_names_from_weights(pca.components_, df_DR.columns) 

# Scatter plot the output
plt.style.use('default')
f, ax = plt.subplots(figsize=(30, 30))
ax = scatter(X_pca[:,0], X_pca[:,1], c=Y)
plt.xlabel('Principal Component 1', fontsize=30)
plt.ylabel('Principal Component 2', fontsize=30)
plt.title('Principal Component Analysis 1', fontsize=50)

plt.tick_params(axis='both', which='major', labelsize=15)
plt.tick_params(axis='both', which='minor', labelsize=15)
In [47]:
pca_weight_strings = get_feature_names_from_weights(pca.components_, df_DR.columns) 

# Scatter plot the output
plt.style.use('default')
f, ax = plt.subplots(figsize=(30, 30))
ax = scatter(X_pca[:,1], X_pca[:,2], c=Y)
plt.xlabel('Principal Component 2', fontsize=30)
plt.ylabel('Principal Component 3', fontsize=30)
plt.title('Principal Component Analysis 2', fontsize=50)

plt.tick_params(axis='both', which='major', labelsize=15)
plt.tick_params(axis='both', which='minor', labelsize=15)
In [48]:
print('\033[1m' + 'Principal Component 1: ' + '\033[0m' ,pca_weight_strings[0])
print('\033[1m' + '\nPrincipal Component 2: \n' + '\033[0m',pca_weight_strings[1])
print('\033[1m' + '\nPrincipal Component 3: \n' + '\033[0m',pca_weight_strings[2])
Principal Component 1:  -0.01*yearID +0.32*G +0.33*R +0.33*H +0.33*RBI +0.19*SB +0.21*CS +0.31*BB +0.30*SO +0.21*IBB +0.23*HBP -0.03*SH +0.27*SF +0.28*GIDP +0.15*OBP +0.19*SLG 

Principal Component 2: 
 -0.04*yearID -0.06*G -0.07*R -0.07*H +0.10*RBI -0.46*SB -0.44*CS +0.06*BB -0.04*SO +0.23*IBB -0.01*HBP -0.54*SH +0.04*SF +0.06*GIDP +0.33*OBP +0.32*SLG 

Principal Component 3: 
 -0.43*yearID -0.06*G -0.01*R -0.05*H -0.13*RBI +0.33*SB +0.31*CS -0.06*BB -0.18*SO -0.11*IBB -0.06*HBP -0.00*SH -0.16*SF -0.20*GIDP +0.55*OBP +0.41*SLG 

Interpretation of PCA¶

The principal component 1, 2, and 3 are a linear sum of the different features.

In principal component 1, games played, plate appearances, scores, hits, RBI, walks are the major components of a players offsensive raw stats.

In principal component 2, stolen bases, caught stealing, intentional walks, sacrifice hits are the major components of players speed.

In principal component 3, as noted previously is based on the quality and quantity of hits that were depicted with OBP and SLG.

Of the 3 components, Component 1 is the most important to separate the salary classes of intersection (low, medium, high and elite ) from one another. Our observation would suggest that values greater than 2.5 we can a player will be likely to have high (green) and elite (yellow) salaries (greater than 6 million per year in salary). We also noted that when the offensive stats noted in PC1 are less than 2.5 your likelyhood salary will be in the range of low to medium(less than 6 million per year in salary)

In [49]:
#LDA graphs

lda_weight_strings = get_feature_names_from_weights(lda.scalings_.T, df_DR.columns) 

# Scatter plot the output
plt.style.use('default')
f, ax = plt.subplots(figsize=(30, 30))
ax = scatter(X_lda[:,0], X_lda[:,1], c=Y)
plt.xlabel('Component 1', fontsize=30)
plt.ylabel('Component 2', fontsize=30)
plt.title('Linear Discriminant Analysis 1', fontsize=50)

plt.tick_params(axis='both', which='major', labelsize=20)
plt.tick_params(axis='both', which='minor', labelsize=20)
In [50]:
#LDA graphs

lda_weight_strings = get_feature_names_from_weights(lda.scalings_.T, df_DR.columns) 

# Scatter plot the output
plt.style.use('default')
f, ax = plt.subplots(figsize=(30, 30))
ax = scatter(X_lda[:,0], X_lda[:,2], c=Y)
plt.xlabel('Component 1', fontsize=30)
plt.ylabel('Component 3', fontsize=30)
plt.title('Linear Discriminant Analysis 2', fontsize=50)

plt.tick_params(axis='both', which='major', labelsize=20)
plt.tick_params(axis='both', which='minor', labelsize=20)
In [51]:
print('\033[1m' + 'Component 1: ' + '\033[0m' ,lda_weight_strings[0])
print('\033[1m' + '\nComponent 2: \n' + '\033[0m',lda_weight_strings[1])
print('\033[1m' + '\nComponent 3: \n' + '\033[0m',lda_weight_strings[2])
Component 1:  0.42*yearID -1.61*G +0.11*R +0.72*H +0.76*RBI +0.09*SB -0.26*CS +0.71*BB -0.24*SO +0.20*IBB +0.01*HBP +0.19*SH +0.00*SF +0.30*GIDP -0.05*OBP -0.05*SLG 

Component 2: 
 -0.48*yearID -0.64*G -1.22*R +1.92*H +0.12*RBI -0.15*SB +0.28*CS +0.51*BB -0.05*SO -0.68*IBB -0.01*HBP +0.28*SH +0.31*SF -0.03*GIDP +0.02*OBP -0.18*SLG 

Component 3: 
 0.42*yearID +0.49*G -1.22*R +1.75*H -0.72*RBI -0.09*SB -0.08*CS -0.89*BB +0.13*SO +0.49*IBB +0.26*HBP +0.32*SH +0.23*SF -0.03*GIDP +0.18*OBP -0.15*SLG 

Interpretation of LDA¶

In component 1, games played, plate appearances, scores, hits, RBI, walks are the major components of a players offsensive raw stats.

In component 2, stolen bases, caught stealing, intentional walks, sacrifice hits are the major components of players speed.

In component 3, as noted previously is based on the quality and quantity of hits that were depicted with OBP and SLG.

Based on the LDA we note that the prediction of elite salary is based on the raw offesnive stastics and quality of hits. We observe that with component 1 is greater 2.5 and component 3 is greater 0, we observe that a offensive player to obtain salaries of high to elite classes.

Conclusion¶

In conclusion, for offensive players to obtain the best salary from a team, players need to perform at least the 75% percentile of all players stats to be considered by team GMs with large salary contracts.

Opportunities of future analysis improvement

  • Include players positions to omit pitchers from this analysis when determining the salary class of offensive players
  • Include players fielding statistics in combination of offensive statistics.
In [ ]: